2023.4.28 RACゼミ

読んだ論文

REINFORCEMENT LEARNING (DQN) TUTORIAL - PyTorch

キーワード

名前付きタプル

zip関数

2025.8.10 画像サイズ変更【cv2】

ラムダ式 , map関数

可変長引数

tensor.max関数

状態価値関数、行動価値関数、方策（ポリシー）、行動（アクション）、

ーーー

今さら聞けない(1)

We’ll be using experience replay memory for training our DQN. It stores the transitions that the agent observes, allowing us to reuse this data later. By sampling from it randomly, the transitions that build up a batch are decorrelated. It has been shown that this greatly stabilizes and improves the DQN training procedure.

DQN のトレーニングにはリプレイメモリを使用する。　

これは　エージェントが観察した状態遷移（トランジション）を後で再利用するために蓄積するバッファでである。

バッファからランダムにサンプリングすることで構成したバッチ用いることで、トランジションは無相関となり、DQN の学習が大幅に安定化することが示されている。

2つのクラスを利用する。

Transition - a named tuple representing a single transition in our environment. It essentially maps (state, action) pairs to their (next_state, reward) result, with the state being the screen difference image as described later on.

Transitionクラスは1つの状態遷移（トランジション）を表す名前付きタプルである。本質的には時刻$ tにおける状態と行動$ (s_t, a_t)から、次の状態と報酬$ (s_{t+1}, r_t)を求めるものである。

ここで状態$ s_tは後述する画面差分画像を表します。？

ReplayMemory a cyclic buffer of bounded size that holds the transitions observed recently. It also implements a .sample() method for selecting a random batch of transitions for training.

ReplayMemoryクラスは有限長のリングバッファで構成されており、最近観測された状態遷移が格納されている。学習用の状態遷移バッチ生成するためのsampleメソッドを有している。

Now, let’s define our model. But first, let’s quickly recap what a DQN is.

recap ... 要約する

DQNアルゴリズム

簡素化のために、本プログラムにおける環境は決定論的とする（deterministic）、すなわち、全ての式が決定論的に与えられていると考える。# つまり、ダイナミクスを表す式の形やパラメータは変化しないことを意味する。

for the sake of ... **のために

強化学習の分野では、環境が確率的に遷移する場合も扱われている。

Our aim will be to train a policy that tries to maximize the discounted, cumulative reward $ R_{t_0} = \sum_{t=t_0}^\infty \gamma^{t-t_0} r_t, where $ R_{t0} is also known as the return . The discount, $ \gamma, should be a constant between 0 and 1 that ensures the sum converges. A lower $ \gamma makes rewards from the uncertain far future less important for our agent than the ones in the near future that it can be fairly confident about. It also encourages agents to collect reward closer in time than equivalent rewards that are temporally far away in the future.

目的はポリシーを学習させることにより初期時刻$ t_0から将来にわたって得られる累積報酬$ R_{t0}を最大化することである。

$ R_{t_0} = \sum_{t=t_0}^\infty \gamma^{t-t_0} r_t

これが最大化できたならば、任意の時刻における最大の報酬が求まることを意味する。

値引率$ \gammaは0 ~1の定数とし、時刻$ t_0から時間が経過するほと値引率が大きくなるように作用する。

$ \gammaをより小さく設定することは、不確定な遠い先の時刻における報酬を少なく見積もり、より確実に手に入るであろう近い時刻における報酬を多く見積もることを意味する。また、エージェントはより近い将来における報酬を集めることに重きをおくように作用する。

fairly ... かなり

confident ...自信のある

encourages ... 励ます

The main idea behind Q-learning is that if we had a function $ Q^* : State \times Action \rightarrow {\mathcal R}, that could tell us what our return would be, if we were to take an action in a given state, then we could easily construct a policy that maximizes our rewards:

$ \pi^*(s) = \argmax_a Q^*(s,a)

Q学習の主なアイデアは次の通りである。状態（ステート）と行動（アクション）に対応する結果を定める関数$ Q^*が与えられたとすると、ある状態において行動を実行する場合に、報酬を最大化するような政策（ポリシー）を容易に構築できる。

# 現時刻における状態$ sと行動$ aによって定まる関数$ Q^*において、その値を最大化するような行動をポリシー$ \pi^*とする。

#$ \argmax_a Qは最大のQ値をもたらす行動、つまり、アクションは選択可能な候補集合として与えられているのに対し、ポリシーはその中の最適なもの1つを指す。

However, we don’t know everything about the world, so we don’t have access to $ Q^*. But, since neural networks are universal function approximators, we can simply create one and train it to resemble $ Q^*.

しかしながら、環境に関するあらゆる情報を把握することはできないので、最適な関数$ Q^*を定義することはできない。

そこで、万能な関数近似器であるNNを用いて、最適な$ Q^*関数に似た出力が得られるように学習を行う。

resemble ... 似ている

For our training update rule, we’ll use a fact that every $ Qfunction for some policy obeys the Bellman equation:

$ Q^\pi(s,a) = r + \gamma Q^\pi (s', \pi(s'))

学習方法を考えるにあたって、何らかのポリシーに基づいて値を返す関数$ Qはベルマン方程式に従うという事実を利用する。

左辺 :

$ s, a : 現時刻の状態、行動

$ Q^\pi(s,a): ポリシー$ \piに基づいた関数$ Qの値

右辺 :

$ Q^\pi (s', \pi(s')) : 微時刻先の状態$ s'とポリシー$ \piによって定まる$ Q

ベルマン方程式をかみ砕いて解釈する

# ある時刻$ tから終端時刻$ t_fに到るまでの累積報酬を最大化するようなポリシー$ Q^\piを決定するのは難しい問題である。

# そこで、$ tから微小な時間だけ進めた時刻$ t'に到るまでの微小な時間区間における最大化問題を解く。

得られる報酬は$ rとし、行動の結果、状態は$ s'に遷移する。

# $ tにおける最適なポリシーが他の時刻においても最適である限らないので、次は最大化の区間を$ t'から$ t_fまでの時間区間とする。

# $ t'における状態と累積報酬は求まっているので、それを元に先程と同様の考え方で処理する。

# 以上の処理を十分な時刻まで進めることにより、 DP的手法により全体としての最大化問題を解くことができる。